1. Introduction

a. Goal of project

  • In this project, we will gather data from a variety of sources and file formats and then assess data visually and programmatically for quality and tidiness. After assessing the datas, we will make it clean, merge them into 1 file and store the clean_data into a csv file. Not only so, we will process on our wrangled data to produce the insights of data.
  • Some questions the insights of these datasets:

    • Question 1: Which is the most favourite name? Change in Name over time, relationship between name with rating_numerator, favorite_count, reteet_count?

    • Question 2: Stage and its relationship with tweet number, favorite_count and retweet

    • Question 3: favourite breed? Its relationship with rating_numerator, favorite_count, retweet_count

    • Question 4: Relationship between Favorite_Count, Retweet_Count and rating_numerator

    • Question 5: Which time are tweets posted or retweeted, or marked favorite?

b. Datasets overview

We have 3 datasets twitter-archive-enhanced.csv, image_predictions.tsv and tweet_json.txt with 3 different file formats (csv, tsv, json.txt) to read

Data source

  • Dataset 1: twitter-archive-enhanced.csv
    • Source: Download directly from the link given on Udacity
    • Method of gathering: Manual download to get file twitter-archive-enhanced.csv
  • Dataset 2: image_predictions.tsv
    • Source: From the link given by Udacity (actually we can download it directly)
    • Method of gathering: Programmatical download via Requests to get file image_predictions.tsv
  • Dataset 3: tweet_json.txt
    • Source: Twitter API
    • Method of gathering: using Tweepy with twitter_api.py from Udacity

c. Data wrangling

At this part we will perfrom all phase from garthering data, assessing data and cleaning data. The wrangled data are used to visualize and analyze in order to get the insights of datas.

The next part, some of interesting conclusion are given and therefore provides us a better view about the data. The data can tell a lot of thing, not only about the dogs being posted on Twitter and which kind of dogs are most liked. It also reveals some predictions about the breed of dogs and their stage as well. So from image posting on Twitter we can gain informations about the dogs.

But that is not so. We can gain also alot of information about the user who post on Twitter such as, which way they used to tweet, to retweet or like images. The data also reveals information about when they were on Twitter. At this project in short time, only some information can be explored. When there are more time, perhaps we can find out more interesting things not only about the dogs.

3. Exploratory Data Analysis

Research Question 1: Which is the most favourite name? Relationship between name with rating_numerator, favorite_count, reteet_count?

name_ rating_numerator_mean rating_numerator_count rating_denominator_mean rating_denominator_count favorite_count_mean retweet_count_mean
168 Charlie 11.636364 11 10.0 11 10497.090909 2804.545455
206 Cooper 11.300000 10 10.0 10 6939.700000 1880.100000
637 Oliver 11.300000 10 10.0 10 7542.700000 1956.600000
541 Lucy 11.555556 9 10.0 9 11744.555556 3767.777778
887 Tucker 12.000000 9 10.0 9 8622.777778 2175.888889
919 Winston 10.500000 8 10.0 8 9693.750000 2674.875000
664 Penny 10.875000 8 10.0 8 12681.250000 3908.125000
751 Sadie 10.000000 7 10.0 7 3599.142857 1130.571429
529 Lola 10.857143 7 10.0 7 8248.857143 2372.428571
870 Toby 10.571429 7 10.0 7 8419.571429 2511.571429

After cleaning the data and storing it in twitter_archive_master.csv file. We do some small analyze about the dog. The below image presents the most common name of dogs which are posted on twitter. With the chart following, some information along with the name are also included

Conclusion 1:

With the charts we can answer the question that Charlie is the most poplular name of dogs posted on twitter. Following are the name Cooper ad Oliver. However, along with the name it seems that no clue for relationship betweet name of dogs with the number of favourite_count and the number of retweet retweet_count.

Question 2: Stage and its relationship with tweet number, favorite_count and retweet

Conclusion 2:

Pupper is the most favourite stage being posted on Twitter with ca. 66%. Followings are doggo and puppo. Interesstingly is that although pupper are posted mostly on Twitter, its avg.rating is the lowest and the favourite_count and retweet_cout as well. In contrast with doggo, puppo such kind of dogs have the highest amount of rating as well favourite_count and retweet_count no matter when they stay alone or are grouped together. Unfortunately, the picture about them is very few.

Question 3: Favourite breed? Its relationship with rating_numerator, favorite_count, retweet_count

Conclusion 3:

The chart shows Golden Retriever as the most posted breed on Twitter. However, this kind of dog gains not much interrests from people due to its low favourite_count and retweet_count. The same are Labrador and Pembroke. The ones gain the most attention are Toy Poodle and Miniature Pinscher dogs respectively. Out of them, Chihuahua is also a most-liked dog with the number of tweets, favourite_count and retweet_count at the top 3.

Question 4: Relationship between Favorite_Count, Retweet_Count &Rating_numerator

Conclusion 4:

These charts shows a good clue to see that there are tight relationship between favourite_count, retweet_count and rating_numerator. Und such of these pair relationship are positive and quite strong, particularly the relationship betweet favourtite_count and retweet_count.

Question 5: Which time are tweets posted or retweeted, or marked favorite?

Conclusion 5:

The data shows that people tweet at most in Januar and Febuary. From Octorber until December are the months recording the lowest avg.number of tweets. However they retweet top at Juni when the number of tweet is lowest.

Another interresting thing is that in 1 week, instead of weekend, people tend to tweet on Tuesday and then the number of retweet and favourite_count reachs peak at Wednesday

Meanwhile, people tend to tweet on Twitter at the night and in the afternoon, particular at 1.00 o'clock at night and at 4.pm. From 6.00 o'clock to 14.00 o'clock, there are no count of tweet to be reported, but the favorite_ count and retweet_count are still documented. In contrast, the chart shows at 6 a.m, the number of favourite_count and retweet are dramatically increased, then gradually decreased untill 19.00, then increased again.

4.Conclusion:

The data gives us a very interesting view about the dogs and the way people using Twitter. At this project, we have a chance to collect data from website, the do some challenging data cleaning, and make some interesting analysis is. Fortunately, some question we want to know, we can find it from the data after wrangling.

Of course, there will be much interesting information still hiden. But along with time we can explore them in an effectiver way.

Some interesting pictures

1680    Here's a super supportive puppo participating ...
Name: text, dtype: object
Unnamed: 0 tweet_id timestamp text name breed_prediction is_one_dog stage rating_numerator rating_denominator favorite_count retweet_count img_num source jpg_url retweeted_status_id
1680 1744 822872901745569793 2017-01-21 18:26:02 Here's a super supportive puppo participating ... NaN Lakeland Terrier True puppo 13.0 10.0 132810.0 48265.0 1 Twitter for iPhone https://pbs.twimg.com/media/C2tugXLXgAArJO4.jpg NaN
16    Oh my. Here you are seeing an Adobe Setter giv...
Name: text, dtype: object
Unnamed: 0 tweet_id timestamp text name breed_prediction is_one_dog stage rating_numerator rating_denominator favorite_count retweet_count img_num source jpg_url retweeted_status_id
16 16 666102155909144576 2015-11-16 03:55:04 Oh my. Here you are seeing an Adobe Setter giv... NaN English Setter True NaN 11.0 10.0 81.0 16.0 1 Twitter for iPhone https://pbs.twimg.com/media/CT54YGiWUAEZnoK.jpg NaN